Unbiased split selection for classification trees based on the Gini Index

نویسندگان

  • Carolin Strobl
  • Anne-Laure Boulesteix
  • Thomas Augustin
چکیده

The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulationand real datastudies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Sources of Variable Selection Bias in Classification Tree Algorithms Based on the Gini Index

Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical effect of multiple comparisons, but also by an increasing estimation bias and variance of the spli...

متن کامل

Application of Different Methods of Decision Tree Algorithm for Mapping Rangeland Using Satellite Imagery (Case Study: Doviraj Catchment in Ilam Province)

Using satellite imagery for the study of Earth's resources is attended by manyresearchers. In fact, the various phenomena have different spectral response inelectromagnetic radiation. One major application of satellite data is the classification ofland cover. In recent years, a number of classification algorithms have been developed forclassification of remote sensing data. One of the most nota...

متن کامل

A bias correction algorithm for the Gini variable importance measure in classification trees

This paper considers a measure of variable importance frequently used in variable selection methods based on decision trees and tree-based ensemble models, like CART, Random Forests and Gradient Boosting Machine. It is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned. Some authors showed that thi...

متن کامل

Unifying Decision Trees Split Criteria Using Tsallis Entropy

The construction of efficient and effective decision trees remains a key topic in machine learning because of their simplicity and flexibility. A lot of heuristic algorithms have been proposed to construct near-optimal decision trees. Most of them, however, are greedy algorithms which have the drawback of obtaining only local optimums. Besides, common split criteria, e.g. Shannon entropy, Gain ...

متن کامل

Classification Trees With Unbiased Multiway Splits

Two univariate split methods and one linear combination split method are proposed for the construction of classification trees with multiway splits. Examples are given where the trees are more compact and hence easier to interpret than binary trees. A major strength of the univariate split methods is that they have negligible bias in variable selection, both when the variables differ in the num...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Statistics & Data Analysis

دوره 52  شماره 

صفحات  -

تاریخ انتشار 2007